Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Calico to v3.19.2 #1583

Merged
merged 3 commits into from
Aug 10, 2021

Conversation

CecileRobertMichon
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon commented Aug 4, 2021

What type of PR is this?

What this PR does / why we need it: Updates the Calico manifests to use v3.20.0 which fixes projectcalico/felix#2811 (resulting in DNS issues with NMI pod when using Azure-hosted management cluster with Calico).

As a result, undo the workarounds in E2E for private clusters and self-hosted clusters to keep using env vars for identity, they can now use AzureClusterIdentity like the other flavors.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1448, #1353

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Fix DNS issues with VXLAN Calico when using AzureClusterIdentity (Updated Calico to v3.19.2)

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 4, 2021
@CecileRobertMichon
Copy link
Contributor Author

/test ls

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-azure-test
  • /test pull-cluster-api-provider-azure-build
  • /test pull-cluster-api-provider-azure-e2e
  • /test pull-cluster-api-provider-azure-e2e-windows
  • /test pull-cluster-api-provider-azure-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-azure-e2e-full
  • /test pull-cluster-api-provider-azure-e2e-exp
  • /test pull-cluster-api-provider-azure-capi-e2e
  • /test pull-cluster-api-provider-azure-conformance-v1alpha4
  • /test pull-cluster-api-provider-azure-upstream-v1alpha4-windows
  • /test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-azure-windows-upstream-with-ci-artifacts
  • /test pull-cluster-api-provider-azure-apidiff
  • /test pull-cluster-api-provider-azure-coverage

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-azure-test
  • pull-cluster-api-provider-azure-build
  • pull-cluster-api-provider-azure-e2e
  • pull-cluster-api-provider-azure-e2e-windows
  • pull-cluster-api-provider-azure-verify
  • pull-cluster-api-provider-azure-apidiff
  • pull-cluster-api-provider-azure-coverage

In response to this:

/test ls

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 4, 2021
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Aug 4, 2021
@CecileRobertMichon
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-full

1 similar comment
@CecileRobertMichon
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-full

@CecileRobertMichon CecileRobertMichon changed the title Update Calico to v3.20 Update Calico to v3.19.2 Aug 4, 2021
@CecileRobertMichon
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-full

@CecileRobertMichon
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-full

@nader-ziada
Copy link
Contributor

/retest

@nader-ziada
Copy link
Contributor

@CecileRobertMichon There is a change we need to do in our calico.yaml here

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/templates/addons/calico.yaml#L3775-L3776

- name: FELIX_FEATUREDETECTOVERRIDE
  value: "MASQFullyRandom=false"

to

- name: FELIX_FEATUREDETECTOVERRIDE
  value: "ChecksumOffloadBroken=true"

got this from here: https://github.com/projectcalico/felix/blob/31a1f2adfdec9fbe3552d6700a062a7e98c2279e/iptables/feature_detect.go#L60 and here: projectcalico/calico#3145 (comment)

3.20.0 worked for me locally with this change

@nader-ziada
Copy link
Contributor

@CecileRobertMichon There is a change we need to do in our calico.yaml here

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/templates/addons/calico.yaml#L3775-L3776

- name: FELIX_FEATUREDETECTOVERRIDE
  value: "MASQFullyRandom=false"

to

- name: FELIX_FEATUREDETECTOVERRIDE
  value: "ChecksumOffloadBroken=true"

got this from here: https://github.com/projectcalico/felix/blob/31a1f2adfdec9fbe3552d6700a062a7e98c2279e/iptables/feature_detect.go#L60 and here: projectcalico/calico#3145 (comment)

3.20.0 worked for me locally with this change

otherwise, all the changes look good 👍

@CecileRobertMichon CecileRobertMichon changed the title Update Calico to v3.19.2 Update Calico to v3.20.0 Aug 5, 2021
@CecileRobertMichon
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-full

@CecileRobertMichon
Copy link
Contributor Author

/retest

@CecileRobertMichon
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-windows
/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor Author

/retest

@CecileRobertMichon
Copy link
Contributor Author

@nader-ziada Windows and IPv6 clusters are passing, all the others are failing with curl job issues (most ILB, one ELB). Seems like something with the Calico is causing these jobs to fail. I have seen some flakes recently for ILB jobs but not 100% of the time like here.

Also, the public management cluster fails to initialize.

@CecileRobertMichon
Copy link
Contributor Author

I see

2021-08-05 23:07:53.628 [WARNING][57] felix/health.go 188: Reporter is not ready. name="int_dataplane"
2021-08-05 23:07:53.628 [WARNING][57] felix/health.go 188: Reporter is not ready. name="async_calc_graph"
2021-08-05 23:07:54.526 [INFO][57] felix/watchercache.go 174: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices"
2021-08-05 23:07:54.531 [INFO][57] felix/kubeendpointslice.go 101: Unable to list K8s EndpointSlice resources error=endpointslices.discovery.k8s.io is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot list resource "endpointslices" in API group "discovery.k8s.io" at the cluster scope
2021-08-05 23:07:54.531 [INFO][57] felix/watchercache.go 186: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" error=connection is unauthorized: endpointslices.discovery.k8s.io is forbidden: User "system:serviceaccount:kube-system:calico-node" cannot list resource "endpointslices" in API group "discovery.k8s.io" at the cluster scope

In Calico logs, I suspect we need some of the newer changes in the manifests

@CecileRobertMichon
Copy link
Contributor Author

reverted back to 3.19.2 given projectcalico/calico#4810

3.19.2 should have the DNS fix as well, going to try to get the tests passing and we can tackle 3.20 separately.

@CecileRobertMichon CecileRobertMichon changed the title Update Calico to v3.20.0 Update Calico to v3.19.2 Aug 5, 2021
@CecileRobertMichon
Copy link
Contributor Author

all tests passed before pushing MTU update

@CecileRobertMichon
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-full

@nader-ziada
Copy link
Contributor

to make sure self-hosted test is good

/test pull-cluster-api-provider-azure-capi-e2e

@nader-ziada
Copy link
Contributor

/test pull-cluster-api-provider-azure-capi-e2e

@nader-ziada
Copy link
Contributor

the self-hosted test passed, but there are some failures related to upgrades which is probably not related to this PR

@CecileRobertMichon
Copy link
Contributor Author

@nader-ziada the upgrade test is currently broken, see #1555 and #1566

# We set this value to 1350 for “physical network MTU size minus 50” since we use VXLAN, which uses a 50-byte header.
# If enabling Wireguard, this value should be changed to 1340 (Wireguard uses a 60-byte header).
# https://docs.projectcalico.org/networking/mtu#determine-mtu-size
veth_mtu: "1350"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @jsturtevant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@CecileRobertMichon
Copy link
Contributor Author

@nader-ziada I think this is ready to go

@nader-ziada
Copy link
Contributor

yeah, was justing for James' feedback

/lgtm
/approve

Thanks @CecileRobertMichon

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 9, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nader-ziada

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 9, 2021
@nader-ziada
Copy link
Contributor

/test pull-cluster-api-provider-azure-e2e

@CecileRobertMichon
Copy link
Contributor Author

/retest

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Aug 10, 2021

@CecileRobertMichon: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
pull-cluster-api-provider-azure-capi-e2e 81c417e link /test pull-cluster-api-provider-azure-capi-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@nader-ziada
Copy link
Contributor

/test pull-cluster-api-provider-azure-e2e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DNS issues when using AzureIdentity
4 participants